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Abstract 

Recent advances in next-generation sequencing technologies have made it possible to generate large amounts of 
sequence data with rare variants in a cost-effective way. Statistical methods that test variants individually are 
underpowered to detect rare variants, so it is desirable to perform association analysis of rare variants by 
combining the information from all variants. In this study, we use a Bayesian regression method to model all 
variants simultaneously to identify rare variants in a data set from Genetic Analysis Workshop 17. We studied the 
association between the quantitative risk traits Q1, Q2, and Q4 and the single-nucleotide polymorphisms and 
identified several positive single-nucleotide polymorphisms for traits Q1 and Q2. However, the model also 
generated several apparent false positives and missed many true positives, suggesting that there is room for 
improvement in this model. 



Background 

Rare variants are genetic variants that have a minor 
allele frequency (MAF) less than 1%. Many previous stu- 
dies have suggested that rare variants generally have lar- 
ger effects on a trait than common variants. Therefore 
identification of rare variants has become an important 
research topic in recent genome-wide association stu- 
dies. Several statistical approaches have been developed 
to tackle this problem. These methods include the 
weighted sum statistic [1], combined multivariate and 
collapsing [2], the comparison of rare variants found 
exclusively in case subjects to those found only in con- 
trol subjects [3,4], and the kernel-based adaptive cluster 
[5]. Overall, the results observed from these studies indi- 
cate that multiple rare variants collectively contribute to 
the variations of the trait, suggesting that it is desirable 
to use all variants together to identify the associated 
genetic variants with a given phenotype. 

Bayesian regression models have been used in animal 
breeding to predict breeding values based on all avail- 
able single-nucleotide polymorphisms (SNPs) [6]. Many 
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extensions of Bayesian regression models in this field 
have been discussed by Gianola et al. [7]. Bayesian sto- 
chastic variable selection methods have also provided an 
alternative approach to genome-wide association studies 
[8,9]. Srivastava and Chen [10] compared the perfor- 
mance of a Bayesian stochastic variable selection 
method with that of a penalized sparse regression 
method and demonstrated that the Bayesian stochastic 
variable selection outperformed the sparse regression 
and also the single-SNP-based method. Yi and Zhi [11], 
in a recent study, used Bayesian stochastic variable 
selection for the identification of rare variants. However, 
the studies by Srivastava and Chen [10] and Yi and Zhi 
[11] did not estimate the probability that the SNP will 
be associated with the phenotype given the data. 

In the current study, we model common variants and 
rare variants simultaneously using a Bayesian stochastic 
variable selection method. We calculate the regression 
coefficient and posterior probability of association of 
each SNP and use them to measure the association 
between each SNP and the given trait. The difference 
between our method and those of Srivastava and Chen 
[10] and Yi and Zhi [11] is that we estimate the poster- 
ior probability that the SNP will be associated with the 
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phenotype and use that probability to estimate the num- 
ber of SNPs associated with the given trait. We apply 
this method to study the association for the quantitative 
risk factors Ql, Q2, and Q4 in the Genetic Analysis 
Workshop 17 (GAW17) data and successfully identify 
several SNPs associated with the Ql and Q2 traits. 

Methods 

Overall model 

First, let us introduce the model and some notation. The 
model is: 



Ynxi - A* Inxi + X nX p(O px i o a pxl 
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where s °° N(0, a 2 * I) , n is the number of individuals, 
p is the number of SNPs, y is an n x 1 phenotype vec- 
tor, X is an n x p matrix with entries being 0, 1, and 2 
encoded for the genotypes AA, AB, and BB, respectively, 
9 is a p x 1 latent variable vector with entries being 0 
and 1 to perform variable selection, and a is a p x 1 
regression coefficient vector. 9 ° a indicates the ele- 
ment-wise product between 9 and a. If 6j ■ = 1, then a ; 
(SNP /) is included in the model; if Oj - 0, then a ; is 
excluded from the model. P(6j = 1) = n is the prior 
probability that the SNP will be associated with the phe- 
notype in question, where n is the same for all SNPs. 
We assume that the prior probability for k is Binomial 
(B(p, 7i)), where k is the number of SNPs that are asso- 
ciated with a phenotype. a ; is normally distributed with 
mean 0 and variance a 2 , and a 2 is the scaled inverse 
chi-square distribution with scale parameter S a and 
degrees of freedom v a . <j 2 E follows the scaled inverse 
chi-square distribution with scale parameter S e and 
degrees of freedom v s . 

Posterior estimations 

Based on Eq. (1), one can obtain the full conditional 
probability functions (FCPFs) for the parameters (deriva- 
tions not shown). The FCPF for <j 2 is the scaled 
inverse chi-square distribution with scale parameter: 



v s +n 

and degrees of freedom v e + n, where: 
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The FCPF for ft is the normal distribution with mean: 

iixnCy-Xa) ( 4 ) 
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and variance ^ 2 j n . 

The FCPF for a, is the normal distribution with mean 
ctj and variance cr s / C ; - > where: 

n i = X J- 
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and: 
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It is clear that the a j is conditional on all other a / . 
The FCPF for cr^ of each locus is the scale inverse 
chi-square distribution with scale parameter: 



kaa + v a S a 
k + v„ 
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and degrees of freedom k + v a > where ^ is a vector in 
which the a j is not 0. To obtain % , we need to decide 
whether each SNP should be included in the model or 
not. To make this decision, we need to calculate: 



f[ai\e] = o)f(e] = o) 
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In expression (8), aj indicates that the a ; are not in 
the model in the 5th Markov chain Monte Carlo 

= 0 ) is the likeli- 



(MCMC) iteration; and fia-_\6- . ; 
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iteration and is given by: 
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In addition, a ■ indicates that the a ; are in the model 
in the 5th MCMC iteration; and /|a||^ ; 5 =1 j is the 
likelihood that the a ; are in the model in the 5th 
MCMC iteration and is given by: 



In expression (2), x ; is an n x 1 vector that corre- 
sponds to the SNP y. 
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In expression (8), f{9- = 1) = n , and the posterior 
distribution for ji is the Beta distribution 
(Beta (1 + k, 1 + p - k) ). From these likelihoods and the 
sampled n, we compute the value of expression (8), and 
use this value to determine whether 9- is 1 or 0. Pos- 
terior estimations are based on the samples from the 
given FCPFs using MCMC sampling. 

MCMC sampling 

MCMC sampling works as follows. For each MCMC 
iteration, we first sample cr^ from its FCPF and ft from 
its FCPF. Next, for a lt a 2 , cc 3) GCp a p , we sample 
from the FCPF for a ; . Whether a ; is included in the 
model or not is determined, and <jj is updated by ~ 2 . 
% is estimated. Next we sample from the FCPF for a\ . 
Finally, we sample tt from Beta (1 + fe, 1 + p - fe) . 

We performed 15,000 MCMC iterations and used the 
first 1,000 iterations as the burn-in period. The inclu- 
sion probability for a ; is based on the proportion of 9j = 
1 in all the MCMC samples after the burn-in period. 
This probability is used as the posterior probability of 
association (PPA) for each SNP. 

Data set 

The GAW17 data set includes 697 unrelated individuals; 
each individual has 24,487 SNPs. The MAFs of the 
SNPs range from 0.0717% to 49.9283% [12]. Our analy- 
sis is based on quantitative traits Ql, Q2, and Q4. The 
GAW17 answers show that Ql is associated with 39 
SNPs in 9 genes from the vascular endothelial growth 
factor (VEGF) pathway, that Q2 is influenced by 72 
SNPs in 13 genes related to cardiovascular risk and 
inflammation, and that Q4 is not affected by any of the 
available SNPs. There are 200 data replications for each 
trait. We perform an analysis for each replication and 
obtain the average regression coefficients and PPAs for 
each SNP from the 200 data replications. We then use 
the different cutoffs of the regression coefficients and 
PPAs to compute a series of true-positive rates (TPRs) 
and false-positive rates (FPRs). We use the receiver 
operating characteristic (ROC) to compare the TPR and 
the FPR as the cutoffs change and the area under the 
curve (AUC) to measure the performance of the model. 

Results and discussion 

We analyzed the association between quantitative traits 
Ql, Q2, and Q4 and the SNPs in the GAW17 data. For 
each trait, we assigned a relative rank for each SNP on 
the basis of the sorting of the absolute values of the 
average regression coefficients and PPAs of all SNPs in 
decreasing order. In our model, we estimated the num- 
ber of SNPs associated with a trait. Using this number 
( }i ), we identified the SNPs whose ranks are within the 
range of this number. 



Table 1 Association analysis for trait Ql 


Gene SNP 


Regression coefficient 


PPA 


MAF 


FLT1 C13S431 


2.501 (2) 


0.243 (3) 


0.017217 


C13S522 


2.188 (3) 


0.264 (2) 


0.027977 


C13S523 


9.027 (1) 


0.998 (1) 


0.066714 


ARNT C1S6533 


0.478 (6) 


0.043 (4) 


0.011478 


KDR C4S1884 


0.126 (42) 


0.016 (8) 


0.020803 



The third and fourth columns are the average regression coefficients and 
posterior probabilities of association (PPAs) out of 200 replications, 
respectively. The numbers in parentheses for the regression coefficients 
indicate the rank of the SNP based on sorting the absolute values of average 
regression coefficients in decreasing order. The numbers in parentheses for 
the PPAs indicate the rank of the SNP based on sorting the PPAs in 
decreasing order. We use the same notation in Table 2. 

For the Ql trait, the range of the estimated number of 
SNPs associated with Ql is 3 to 8. Based on this range, 
the SNPs whose average regression coefficients and 
PPAs are within the top eight rankings are considered 
associated with Ql (Table 1). The ranks of C13S431, 
C13S522, C13S523, C1S6533, and C4S1884 are within 
the top eight. The GAW17 answers confirmed that all 
five SNPs are truly associated with Ql. C13S431, 
C13S522, and C13S523 are located in gene FLT1, 
C1S6533 is located in gene ARNT, and C4S1884 is 
located in gene KDR. 

For the Q2 trait, the range of the estimated number of 
SNPs associated with the Q2 is 2 to 6. Based on this 
range, the SNPs whose average regression coefficients 
and PPAs are within the top six rankings are considered 
associated with Q2 (Table 2). We found that the ranks 
of C6S5380, C6S5449, C6S5441, C8S442, and C10S3050 
are within the top six. C6S5380 is located in gene 
VNN1, C6S5449 and C6S5441 are in gene VNN3, 
C8S442 is in gene LPL, and C10S3050 is a rare variant 
in gene SIRT1 with a MAF = 0.002152. The GAW17 
answers confirmed that all five SNPs are truly associated 
with Q2. In our analysis, C10S3051 is also identified as 
being associated with Q2. Compared with the GAW17 
answers, this finding is a false -positive association. How- 
ever, we found that C10S3051 is a synonymous muta- 
tion and is identical to C10S3050. The positions of the 
two SNPs are close together, suggesting that the two 
SNPs may be in high linkage disequilibrium. 



Table 2 Association analysis for trait Q2 



Gene 


SNP 


Regression coefficient 


PPA 


MAF 


VNN1 


C6S5380 


0.251 (1) 


0.077 (1) 


0.170732 


VNN3 


C6S5449 


0.194 (2) 


0.015 (3) 


0.010043 




C6S5441 


0.038 (25) 


0.010 (4) 


0.098278 


LPL 


C8S442 


0.152 (5) 


0.016 (2) 


0.015782 


SIRT1 


C10S3050 


0.170 (3) 


0.008 (6) 


0.002152 



See Table 1 notes for an explanation of the notation. 
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For the Q4 trait, the range of the estimated number of 
SNPs associated with Q4 is 3 to 6. Based on this range, 
the SNPs whose average regression coefficients and PPAs 
are within the top six rankings are considered associated 
with Q4. Compared with the GAW17 answers, these 
SNPs are false positives. We observed that the correlation 
coefficient between Ql and Q4 is -0.293 and that there 
are 39 SNPs that are associated with Ql, which could be 
the reason for these false positives. 

The results for Ql and Q2 demonstrate that our 
model identified several true SNPs associated with Ql 
and Q2 but missed many true positives for these two 
traits. To assess the model's performance for identifying 
rare variants, based on the association results using all 
SNPs, we extract the SNPs with a MAF less than 1% 
and calculate the AUCs using all SNPs and using the 
rare variants only for Ql and Q2. Table 3 shows that 
the AUCs using all SNPs range from 0.774 to 0.808; the 
AUCs using only the rare variants range from 0.699 to 
0.724. We obtained a reasonable power to detect rare 
variants using this model. However, it is obvious that 
the power of our model to identify rare variants is less 
than the power to identify common variants. These 
results could be due to the small effects of SNPs with 
lower MAFs, and our model shrinks their regression 
coefficients to 0. 

Several other factors could also have played a role in 
causing the false negatives and false positives. First, we 
observed that there is an outlier for the Ql trait. Several 
studies have shown that removing this outlier might 
increase the detection power. Our analyses did not con- 
sider these outliers, so we expected that we could 
increase the power by removing the outliers in the sub- 
sequent analyses. Second, the structure information of 
the SNPs was not included in the model, although all 
SNPs were modeled simultaneously. Many studies have 
shown that collapsing SNPs into blocks based on linkage 
disequilibrium, a gene, or a biological pathway can 
increase the power to detect associations. In a future 
study, we plan to model the correlations between the 
SNPs or to collapse the SNPs into blocks first and then 



Table 3 AUC of the model using all SNPs and the rare 



variants only for traits Ql and Q2 



Trait 


AUC based on regression coefficients 


AUC based on PPAs 


Q1 


0.791 (all SNPs ) 


0.808 (all SNPs) 




0.699 (RV only) 


0.724 (RV only) 


Q2 


0.776 (all SNPs) 


0.774 (all SNPs) 




0.724 (RV only) 


0.718 (RV only) 



"All SNPs" indicates our analysis based on all SNPs; "RV only" indicates our 
analysis based on only the rare variants (MAF < 0.01). 



apply this model to the blocks to see whether the detec- 
tion power of this method can be increased. 

Conclusions 

In the present study, we modeled all SNPs simulta- 
neously to study the association between the SNPs and 
the quantitative risk traits Ql, Q2, and Q4 using a Baye- 
sian regression method. Some true associated SNPs for 
Ql and Q2 were identified using this method. However, 
our model missed many true positives and generated 
several false positives, suggesting that there is room for 
improvement. 
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